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ABSTRACT 


Determining the effectiveness of a computer simulation model in 
duplicating a desired real world phenomenon is an important unsolved 
problem. The purpose of this paper is to model the validation pro- 
cedure in a broad context and develop a general methodology for the 
statistical part of validation. A procedure calling on utility, 
decision, simulation, and statistical theories is developed. The 
goals of statistical testing are presented, and the assumptions, prop- 
erties, and results of several parametric and nonparametric tests are 


discussed and compared. 
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I. INTRODUCTION 


With the advent of complex computer simulations of real world phe- 
nomena, a means of judging the worth or validity of the simulation has 
become very important, yet to associate with any simulation model a 
strict valid-invalid judgement is quite misleading. Models can be con- 
sidered valid under certain circumstances and invalid under others, or 
when compared using different criteria they may be considered first 
valid then invalid. 

In the past one of the largest problems of model validation has been 
the definition of the term. Normally what has been thought of as model 
validation is the statistical testing of collected data and the compari- 
son of the results with a predetermined test criterion. For this reason 
validation has come under attack for being a form of statistical chica- 
nery and merely a means to add credence to an already accepted model. 

The purpose of this paper is to redefine the validation procedure 
in a larger context and to describe some of the problems, techniques, 
and assumptions associated with the statistical portion of validation. 
It is hoped that with this procedure model validation will become a more 
definitive process and that the mystical air normally associated with 


Statistical testing procedures will be removed. 


IT. VALIDATION PHILOSOPHY 


The present procedure of validation is basically as follows. After 
the requirement to validate a model has been given, agreement on a sig- 
nificance level for statistical testing is reached. The data is then 
given to a statistician to find an appropriate testing procedure. Using 
the predetermined level of significance it is then determined whether the 
difference between the real world and model data is significant. The 
decision-making procedure is quite simple, if the results of the test 
indicate a significant difference then the model is said to be invalid. 
If the difference is not significant then it is considered valid. This 
procedure is shown in Figure | and is the one used in a recent 
validation [12].” 

This type of apparently straightforward validation has two basic 
problems. The decision rule while seemingly well defined is actually 
more complex and involves such things as cost and utility models as well 
as statistical theory. As an example, in a validation done by the 
Systems Analysis Group [23] the level of significance was set at a level 
of .5 in order to make the probability of accepting an invalid model 
small. This decision must have involved consideration of the costs of 
accepting an invalid model and of rejecting a valid model. After com- 
piling these costs the principles of utility theory must have been used 
in arriving at a figure of .5 as the best level of significance. But, 
none of these considerations were mentioned in the report of the valida- 


tion. So in the past, and even on present validation projects, the 


* Number in brackets is reference number as listed on pages 50 and 51. 
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A PRESENTLY USED MODEL VALIDATION SCHEME 


decision rules are not fully explained, but should be if meaningful 
validation is to be achieved. The second problem is that much confusion 
exists in the method of selection of a particular statistical test. In 
very few cases, if ever, will a test be perfect for the data. An assump- 
tion will therefore be relaxed slightly to make use of a strong property 
of a test, yet if another test were used that assumption might not have 
to be violated. The question of which assumptions and properties of a 
test are most important is very complex and the answers not clearly 
defined. Thus the properties and goals of statistical tests must be 

more completely defined. It must also be realized that while the passing 
or failing of a single test or at the most of a few tests constitutes a 
decision rule now, the results of these tests should only correspond to 

a single element of an n dimensional decision vector. 

The philosophy of the present validation procedure is sound. What is 
proposed is a new procedure, rather than philosophy, directed at allowing 
the decision maker more flexibility. This involves describing decisions 
in terms of utility, simulation, and decision theory as well as just 
collected data and statistics. 

The procedure can be thought of as the interchange of information 
between three modules, Simulation Theory, Statistical Theory, and 
Decision-Utility Theory. Within each module there are several nodes 
such as the testing node in the statistical module. Several nodes such 
as the criterion node share modules. In general the procedure would work 
as follows. Information concerning the model and real world, such as 
data, flow from the state of nature node through the validation informa- 
tion node and into the decision node. From the decision node several 


paths exist. The validation may be terminated, by accepting or rejecting 


the model, the model may be temporarily rejected while further data is 
gathered or additional comparisons conducted, or the decision may be to 
compare the information by means of a statistical test. Regardless of 
the choice, if the validation is not terminated more information enters 
the validation information node as a result of the decision and the pro- 
cess will continue in a cycling manner until the decision to terminate 
the validation is given. Figure 2 illustrates the concept of modules and 


nodes interacting to form a validation procedure. 
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III. VALIDATION DECISIONS 


There are four possible outcomes of any decision rule. These are 

based on the two decisions: 

Do = Decision the model is valid 

D, = Decision the model is invalid 
and the two possible underlying states of nature: 

So = The model is valid 

S, = The model is invalid . 
One possible outcome is DoS 4 or the incorrect decision that the model is 
valid when it is invalid. The other outcomes are DoSq> D4So> and D, 1: 
Why the decision maker makes a particular decision depends upon the 
decision rule he is using. As an example of a simple decision rule con- 
Sider a validation procedure which is similar to the one presently being 
used. It consists of one model, the two states of nature So and Sy> and 
the two decisions Do and D,. The statistical test is exact, that is: 


Pax ] 


r 


P(X* = x) |S)) = 1 


Xq/Sq) 


or when the state of nature, S, is So then the test statistic X* is Xo 
and similarly when S = Sy > then X* = xX) The simple decision rule is: 
if X* = x, then D = D 


and if X* = xy then D = D Again note that this 


0 0 i 
is basically what is done in present validations. A test is performed 
and according to the results of the test alone the model is said to be 
valid or invalid. 

Expanding the above, consider the following procedure consisting of 


the same model and states of nature. The test is no longer exact. Now, 


when §S = Sq X* is a random variable with density function Py (x) and when 


15 


S = S, then X* is a random variable with density function p, (x). X* 


] 
represents the test statistic whose range is the real line X. Since X* 
is now a random variable the decision rule may become more complex but 


the possible results are still the same. 
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RESULTS OF DECISION RULES AND THEIR PROBABILITIES 


The probabilities of making the decisions are: 


a = Probability of rejecting a valid model 

8 = Probability of accepting an invalid model 
l-a = Probability of accepting a valid model 
1-8 = Probability of rejecting an invalid model. 


As an example of another decision rule consider a case where, 
Py (x) is normal (0,07) 
py (x) is normal (u>5607) or > 0. 
Let X,) be that portion of the real line, X, such that all points in Xo 


"y 
are less than or equal to X , while the other points constitute Xy> thus: 


Xo SK 


X =a xy 4 


x" is an arbitrary point and can be determined either before or after 
data is collected depending upon the decision rule. If x" is deter- 
mined before the test is performed then the decision rule might be 

iat WX ers ih X, then D,: otherwise Do: With this decision rule the 
corresponding decision probabilities are: 


p.(Dy|S;) = fF = probability of accepting an invalid 


X mode | 


p,.(D, |So) = qs = probability of rejecting a valid 


mode | 


=< 


>< 


mode |] 


= probability of rejecting an invalid 
mode] 


| 
| 
p(Dp|So) = I-a = | 
| 


p(D,|S,) = 1-8 = 


p, (x)dx 
0 
Pa (x) dx 
] 
Po (x)dx = probability of accepting a valid 
0 
p, (x) dx 
a 


Figure 4 shows the functions graphically with X° and the probabili- 


ties, o.-and @. 
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GRAPHIC. DISPLAY.OF POSSIBLE DISTRIBUTEONS OF TES TesiAt fsite 
ASSUMING Py (x) = n(0,0*) AND p(x) = Al i,o-) 
It should be noted that Figure 4 represents a much simplified pair 
of density functions. The nature of validation and the associated test 
Statistics often prevents any knowledge of the exact distribution of 


p, (x). In only one of the tests performed in this paper can 8B be 
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readily determined. Another complicating feature of validation is that 
p, (x) usually flanks Py (x) or even overlaps Py (x) over its entire 


domain. Figure 5 shows these possibilities. 





FIGURE 5 


POSSIBLE INTERACTIONS BETWEEN Pg (x) AND p, (x) 


In both parts a) and b) of Figure 5 the previous decision rule could 
have been used, but by associating costs with the various results of a 
decision rule, a payoff matrix can be formed and another type of deci- 
sion rule employed. 

Utility theory and cost structures will determine the values of the 
decision matrix but in general the 5195 element will represent the 
highest cost for it represents the acceptance of an invalid model and 
thus all subsequent decisions based on the assumption that the model is 
valid will also be in error. SQ will usually cost the least but the 
ordering of 54D, and SQ); will vary depending on the costs of additional 


experimentation and realignment of the model. 


With the costs of each decision result determined, and the decision 


maker willing to accept a priori knowledge of PAS =0S).and pes = Sy); 


9) 
then he can by using a rule such as minimax chose a decision which will 
minimize cost. If through information received from the statistical 
module he is willing to accept values of a and B then the costs of the 
decisions will become expected costs and by using the same minimax deci- 
sion rule the minimum expected cost can be found. Thus with this deci- 
Sion rule, the relationship between the statistical and decision modules 
can be seen as one in which additional information received from the 
data is transmitted to the decision maker allowing him access to more 
information about the model and helping him to refine his decision. 

The choice of which decision rule to use is a subject in itself and 
is left for future study. Regardless of the method chosen though, the 
value of statistical information from the data is apparent. 

How to realign the simulation model, when the decision to reject it 
is made, is the subject of simulation theory. This also is a complex 


field and left for future study. 


le 


IV. STATISTICAL THEORY MODULE AND TESTS 


Two measures of the probability of rejecting a valid model are a and 
P where P is determined by finding the largest value of a for which the 
null hypothesis that the model and real world are sampling from the same 
distribution can be accepted given the test statistic. Thus P repre- 
sents the value of a at which the decision concerning the null hypothe- 
sis passes from acceptance to rejection and is determined after the 
test statistic is computed, whereas a is arbitrarily predetermined and 
used to compare with the results of the test. When comparing different 
tests, two approaches could be used. Either ana level of significance 
could be predetermined and each test receive a pass or fail rating, or 
a P value could be determined. For more sensitive comparisons P values 
will be determined when applying data to tests in the following 
sections. 

The statistical module of model validation operates as follows. 
Real world data is compared with model simulation data by one of many 
statistical tests. For each test the Pg (x) is known and the value of 
the test statistic computed. Given Pg (x) and the test statistic the P 
value is determined along with 6 if py (x) is known. This information 
is then passed into the test results node for further transmission into 
the validation information node. If a was predetermined and the test 
Statistic X* fell into the critical region, i.e., P was less than a, 
then based on that particular test the decision that the model be 
accepted cannot be endorsed. The decision to reject or accept a model 
could be thought of as an n dimensional vector of which the test result 


is merely one component. 
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The types of statistical test previously used for model validation, 
and most likely to continue being used, are both parametric and non- 
parametric. Before describing several of these tests, a description of 


their inherent differences should be useful. 


A. PARAMETRIC AND NONPARAMETRIC TESTS 

A parametric statistical test is a test in which specific assump- 
tions, such as up = 0 or W = Uo about the parameters of the sampled 
population are made, whereas a nonparametric test, as the name implies, 
makes no assumptions about the value of the parameters in the sampled 
population but rather assumes only that a distribution exists. Another 
term often used interchangeably with nonparametric is distribution free. 
A distribution free test differs from both the parametric and nonpara- 
metric tests in that it makes no assumptions about the form of the 
sampled distribution. In this paper the terms nonparametric and distri- 
bution free will be used interchangeably. More important than the dif- 
ference in the definitions is the difference in the assumptions which 
must be made when testing parametrically vice nonparametrically. To 
determine critical values both tests require that the distribution of 
the test statistic be fully known. In the case of the parametric tests 
this often requires that the sample size be large so that the asymptotic 
distribution of the test statistic is known. The distribution, Py (x) » 
of the test statistic in the nonparametric case is generally known pre- 
cisely and need not be assumed. Other assumptions of the parametric 
tests may include independence of observations, underlying normal dis- 
tribution of the sampled populations, homoscedasticity or at least, 
known ratio of variances among populations in the case of a multiple 


Sample test, and that the data is measured in at least an interval 


2 | 


scale, meaning that operations with the data are isomorphic to arithmetic. 
The assumptions associated with nonparametric tests include only that 
sampled populations be continuous, and in some cases, be symetric or 
identical. As with parametric tests, the observations are assumed to 

be independent. For a’more complete discussion of the assumptions see 
ee: 

The more practical advantages of the nonparametric tests include 
their intuitive attraction, simplicity of derivation, and ability to 
be understood conceptually. They are often times easier to apply, but 
this quality deteriorates rapidly as the sample size increases past 30. 
Perhaps the largest advantage, however, is their statistical efficiency.” 
As Bradley explains [3]: 

When judged by the mathematical criterion of statistical effi- 

ciency, distribution-free tests are often superior to their 

most efficient parametric counterparts when both tests are 

applied under "nonparametric" conditions, i.e., conditions 

meeting all assumptions of the distribution-free test, but 

failing to meet some of the assumptions of the parametric 

test. When both tests are applied under "parametric" condi- 

tions, i.e., conditions meeting all assumptions of the para- 

metric test, and therefore of both tests, distribution-free 

tests are usually very slightly less efficient at small sample 


Sizes, becoming increasingly less efficient as sample size 
increases. 


Thus with large samples the parametric tests are more powerful provided 
that their assumptions are met. This margin of power enjoyed by the 
parametric tests decreases with sample size until the sample size becomes 
small enough, 6-10, that the power differential is insignificant. On the 


other hand, when the parametric assumptions are falsely made, but the 


oS 


* Power or statistical efficiency is defined as the ratio of the para- 
metric test sample size to the nonparametric test sample size in order 
to make the power of the two tests equivalent. If the power efficiency 
of a nonparametric test is 96%, then if the more powerful parametric 
test has 10 samples the nonparametric test must have only 10/.96 = 10.4 
Samples to be of equal power. 
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nonparametric assumptions are not, the nonparametric tests are often 
more superior. 

Since one of the underlying assumptions in validation is that the 
sample size of real world data will be quite small it is necessary to 
consider the effect of the parametric and nonparametric assumptions in 
terms of small samples, i.e., less than 10. Again according to Bradley, 
when the parametric assumptions are violated they have their most 
drastic effect and in addition are most unlikely to be detected due to 
the small sample size. If a parametric test can be used, even though 
it is more powerful, its advantage over the nonparametric test is slight 
due to the small sample size. 

Because of the many facets of both types of tests it would be foolish 
to say that only tests of a single type should be used. Equally as 
foolish would be an attempt to categorize the types of data to be 
validated with specific statistical tests. This choice remains in the 
decision node of the validation procedure. So rather than attempt such 
a recipe it is beneficial to look at what has been done in several 
validations and what the differences in critical regions or P values are 
when tests requiring various assumptions are performed on the same data. 
In order to examine these differences, a sensitivity analysis on two 
sets of data with respect to statistical tests and their inherent 


assumptions was performed. 
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V. NONPARAMETRIC TESTS WITH SUBMARINE DATA BASE 


The data used in the nonparametric tests of this section was obtained 
from a sequence of submarine exercises in which a submarine attempted to 
detect another submarine transitting through a defined region. If 
detection was made then both the range and aspect of the detected 
Submarine were recorded. A stern aspect indicates a retreating contact 
and the corresponding range would be negative. A positive detection 
range indicates a bow aspect at initial detection of an incoming sub- 
marine. Thus for a given exercise the data might be: out of 10 
possible detections initial detection was made at 8, 3, -6, and 1 miles. 
This should be interpreted as follows: the exercise was run 10 times 
and detection occurred on 4 runs. On these runs the initial detection 
was made when the transitting submarine was at a range of 8, 3, and 
1] miles and closing, while on the fourth run detection was made at 6 
miles but the range was opening. These exercises were simulated on 
the computer and similar results tabulated. Summarizing, the following 
constitutes the data base for this validation. The submarine model was 
tested under 10 various conditions such as speed and depth. Calling 
each set of conditions an input, there are 10 distribution functions 
each of which corresponds to an input. For each of the inputs there 
are several samples from the real world exercises and many, 100-120, 
samples from the computer simulation model. Two measures of effective- 
ness have been observed namely the frequency of detection and range of 
initial detection. 

Once again, the goal of the statistical module 1s to determine 


P(x) » and py (x), the size of the critical region or P value, and 8B. 


Where the distribution of the test statistic under the null hypothesis 
that the real world and model are sampling from the same distribution is 
Pg (x) » and py (x) is the distribution of the test statistic when the two 


distributions are not the same. 


Age REIS 

The tests chosen to illustrate what might be done in validating the 
Submarine model include the nonparametric Kolmogorov-Smirnov Test, the 
Fisher Exact Test, the Wilcoxon Test, and the tests used in the initial 


validation of this model [24]. 


B. KOLMOGOROV-SMIRNOV TEST 

Perhaps the most heuristic of the statistical tests is the Kolmogorov- 
Smirnov Two Sample Test often referred to as the Smirnov Maximum Devia- 
tion Test. The test statistic is the maximum deviation between the two 
empirical cumulative distribution functions. 

To compute the test statistic rank the n real world and m model 
observations and give each a subscript corresponding to its rank. For 


each possible rank i, i=1,...,ntm, calculate d.. Where: 


ae Sy 
ee ee 
dq. Pon m 
and 
fs the number of real world observations less than the ith 
order statistic 
s. = the number of simulated observations less than the ith 


order statistic. 
The test statistic, D, is max |d.|, i=1,...,n#m. Under the hypothesis 
that the observations came from the same distribution, the distribution 


of D is known and can be calculated for any combination of ntm [4,20]. 


As an illustration of this test consider the 10th input where n+m 
is 106, and n=3. The ranges of real world detections were ranked among 
the model detections and the values of 1 are 1 = 7,26,38. The two step 


functions are shown in Table I. 


TABLE I 


CUMULATIVE STEP FUNCTIONS IN KOLMOGOROV-SMIRNOV 
TEST WITH INPUT 10 OF SUBMARINE DATA 


oral 2.6 3 1.Oie ‘ 
710 a6 30 4 0 50 60 70 80 90 100 110 


D occurs at 1 = 38 where 


dag ete = Sof NOG Ean 67 . 


38 
Using the approximation to Py (x); the P value is found to be .2024. 
Table II gives a summary of the P values when the test was applied in a 
similar fashion with the remaining 9 inputs. 

This test has all the previously mentioned advantages of nonpara- 
metric statistics, especially intuitive appeal. It also has the 
advantage of testing for differences in the distributions caused by all 
the properties of the distribution function instead of just the dif- 
ferences in mean or variance. 


A major restriction is placed on the validity of the results by 


using the approximation to Py (x). Hodges [15] has shown that as m and n 


26 


increase the approximation may differ significantly from Po (x) » thus not 
only does power efficiency decrease with increase sample size but the 
approximation of Pg (x) also becomes less valid. The effect on P of 
approximating the distribution function can be shown by comparing the 
results of this approximation with those of an exact test. This is 


Shown in Table XVII, page 45. 


TABLE 


SUMMARY OF RESULTS FOR KOLMOGOROV-SMIRNOV 
TEST WITH SUBMARINE MODEL DATA 


INPUT P VALUE 
1 485 
? . 260 
3 . 980 
4 .998 
5 94] 
6 .49] 
/ 922 
8 Oe] 
9 By ey 
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C. WILCOXON TEST IN THE ORIGINAL VALIDATION 

In the original validation of the submarine encounter model and the 
associated data [26], the Wilcoxon Test was used with the initial range 
of detection data. In this test the observations from both sources for 
a given input are aggregated and ranked in order of magnitude. If both 


model and real world are sampling from the same distribution then all 


2] 


combinations of ranks are equally likely. The smallest and 

largest possible rank sums make up the critical region since they 
represent the least likely results. For example, suppose the real 
world detections occurred at 1, 5, 8 miles out of 6 runs and out of 
20 runs the simulation results had initial detection at 6, 10, 12, 14, 
20 miles. The summed ranks for the real world would be 1+2+4=7 and 

at an alpha level of .109 the difference between real world and model 
results is significant. 

Table III lists the acceptance regions in ranked sums for various 
model and real world outputs. In each case the level of significance 
is .109 and the real world ranks are to be summed. Exact P values were 
not found in the original validation. 

There are two difficulties with the original use of the test 
however. In an apparent attempt to avoid the tedious counting proce- 
dures outlined in the next section, the model data was divided into 
sets of 20 then tested with the set of real world data. Each test was 
considered independently, thus the level of significance is considered 
to be (1-.109)® or .5. This is derived by the following argument. 
Since a was predetermined as a > .5 then Cpe S 5 where P, is the 
probability of failing the test if the state of nature is Sq: and n is 
the number of tests. If n = 6 then an becomes .109. It seems very dif- 
ficult to believe however that these tests are independent if the same 
real world observations are to be used in each test. The second dif- 
ficulty is the method in which the rank sums were determined. Instead 
of considering an initial detection of 8 miles differently than one of 


-8 miles, both were given the same rank. 
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TABLE III 


RANK-SUM ACCEPTANCE REGIONS WITH a = .109™ 
R2 
2 3 4 5 6 

RI 

2 - 7-12 11-18 

3 3-8 7-14 2 22) 

4 3-10 8-17 2223 

5 A=ai2 8-18 13-26 

6 4-13 9-21 14-29 

7 4-15 10-24 16-33 23-42 

8 Seaili7 10-26 16-35 24-46 

g 5-19 11-28 18-38 25-49 

10 5-20 Zee 19-4) 27-53 

1] 6-22 12-32 20-44 29-57 38-70 
12 6-24 13-35 21-47 30-60 40-74 
13 6-25 13-37 22-50 32-64 42-78 
14 ee 15-40 23-53 33-67 43-82 
15 7-29 15-42 24-56 34-70 45-86 
16 8-3] 16-45 25-59 36-73 47-90 
17 8-32 16-46 26-62 37-78 49-95 
18 8-34 17-49 28-65 39-82 51-99 
19 9-36 - 28-67 40-85 53-103 
20 = : 80-71 4] -88 55-107 
R, = No. of detections by model in sample size 20. 


7 
Ul 


9 No. of detections in real world exercise, with 6 runs. 


* By permission from Submarine ASW Encounter Simulation Model Detection 
eo (U) by Systems Analysis Office, ASW Systems Project Office 
G7. 


a, 


Both these difficulties are corrected and a more exact test made in 
the next test. Table IV presents a summary of results using this 


testing procedure. 


TABLE IV 


SUMMARY OF THE ORIGINAL WILCOXON TEST 
RESULTS USING THE SUBMARINE DATA 


RUN NUMBER P VALUE 
] greater than .5 
Z greater than .5 
3 greater than .5 
4 greater than .5 
5 greater than .5 
6 greater than .5 
7 greater than .5 
8 less than .5 
9 greater than .5 
10 less @than 35 


D. EXACT WILCOXON RANK SUM TEST 

An improvement over the original use of the Wilcoxon Rank-Sum Test 
in validating this model can be made by not dividing the model data 
into subsets but rather considering it as a sample of size 120, and by 
determining the exact distribution of the Wilcoxon Test statistic. In 
order to determine the distribution of P(x) it is necessary to compute 
Such things as the number of possible ways 4 numbers can be sampled 
without replacement from the positive integers 1 through 124 such that 


their sum is always less than or equal to 165. A recursive counting 
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procedure for large numbers like 165 has been developed by Fix and 
Hodges [10] and was used in determining P values for 4 of the 10 inputs. 
As an approximation to the test it can be realized that the ranks form 
a finite population, thus the expected value and variance of the average 
rank sum can be determined exactly. The distribution of the average 
value of the observed ranks minus its expected value and divided by its 
Standard deviation is approximately the unit normal. Kruskal and Wallis 
L16] have suggested an addition correction for continuity when using 
this approximation. 

As an example of the effect of the normal approximation observe the 
Summary of P values using the Wilcoxon Rank-Sum Test in Table V. 

These are the first results based on the exact distribution of 
Dg (x) » but as shown by the results in Table V it is evident that the 
approximations are quite close. Again the inherent advantages of the 
nonparametric statistics are present but also is the lack of knowledge 
of py (x). For a more complete discussion of the efficiency of this 


test see [5]. 


E. BINOMIAL TEST IN THE ORIGINAL VALIDATION 

The other measure of effectiveness used in the statistical portion 
of the validation of this model is the probability of detection, P,. 
In the original validation [25] a very inexact test was used. If m 
model runs and n real world runs were to be compared then the probabili- 
ties of each possible outcome were estimated. These probabilities are 
Shown in Table VII for n equal 4 and m equal 20. 

To see the inexactness of this test consider how the probabilities 


are determined. The probability of x detections in n runs is 


eget 


3] 


TABLE V 


SUMMARY OF SUBMARINE MODEL P VALUES WITH 
DETECTION RANGE DATA USING THE WILCOXON RANK-SUM TEST 


INPUT EXACT APPROXIMATION 

] . 3140 

2 . 0066 .009322 

5 . 7900 

4 . 7860 

9 "e735 Ov ae 
ghorey .2448 

7 . 5666 

8 . 3600 

2 . 2684 

10 J) sey SUSU 

TABLE VI 


SUMMARY OF P VALUES USING NONPARAMETRIC 
TESTS ON SUBMARINE MODEL RANGE OF DETECTION DATA 


INPUT K-S WILCOXON RANK ORIGINAL RANK 

TEST SU cb SUM TES 

APPROX. EXACT APPROX. APPROX. 
] ~485 . 3140 rie) 
Z .260 .0066 OU 3522 Sasa, 
3 . 980 . 7900 rs) 
4 mechs . 7860 neers 
9 94] foo .8728 Sait) 
6 49] 5 455 13) . 2448 ee 
J, :9@2 e506 aes 
8 eo . 3600 aed 
9 toe . 2684 oe 
10 Ae .0952 .0902 aS 


a2 


If the null hypothesis is true and the model and real world runs are 
independent then the probability of observing y out of m detections from 


the model and x out of n detections from the real world is: 


ea CE 


x,y) is a concave function and by taking derivatives with respect to 


(n-x) 
= [hey (7 
Pa(xay) = []P¥(1-Py) 
P| 
Pa? jt can be shown that if O < x+y < mtn then: 


to «fe a) aa - Phan 


y} tx} tntm m+n 


Thus P(x,y) is an upper bound on the probability that x and y detections 
will] occur. P(x,y) are the values listed in Table VII. 

The data has again been divided into groups of 20 and thus the 
critical region reduced to .109 for each test. For the particular values 
of n and m the critical region has been partioned in Table VII. Should 
a pair (x,y) fall into this region for any of the six tests then the 
hypothesis is rejected at the .5 significance level. 

The primary objection to the test besides the division of mode] 
observations into groups of 20 is the fact that each P(x,y) is equal 
to or larger than its exact value yet the size of the critical region 
is still assumed to be .109. This would seem to indicate that when the 
null hypothesis is accepted using this test, it might be rejected when 


using a more exact test. This is in fact the case as shown in Table 


XI, page 37. 


Pe ORISMER EAACT TEST 

When using the number of detections divided by the number of runs 
to test model validity, tests having more exact knowledge of Pg (x) are 
also available. One such test is the Fisher Exact Test based on the 


hypergeometric distribution. The P value is determined by computing 
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TABLE VII 


) VALUES FOR n = 4 AND m = 20” 
4 

y 

0 .062623 xn 06144 .000474 .000021 
] Pola t4 .081919 014194 .001611 .000093 
2 . 194556 .08989 1 022945 003524 .000262 
3 . 134835 .091778 .031708 .006277 .000583 
4 .097514 .089873 040012 _— .009900 .001125 

Kkkkkkkkk 
5 .071870 085359 | 047517 01439] 001973 
6 .053350 .079194 .053965 .01972]) .003230 
i .039598 071953 059163 025835 005023 
8 029231 .064093 .062972 032647 .007509 
KKEKKKKRKKKKK 
9 021365 055975 065294 .040045 | .010883 
10 015393 .047883 .066074 .047883 015393 
11 .01088 040045 065294 .055975 .021365 
KKKKKK KKK : | 
lz 007509 .032647 062972 | .064093 029231 
13 005023 025835 059163 071953 .039598 
14 FO@S230 019721 053965 .079194 | .053350 
15 001973 014391 047517 .085359 .071870 
KKEKKKKKKKKEK 
16 001125 .009900 .044012 .089837 097514 
17 .000583 006277 AOE FAE: .091778 . 134835 
18 .000262 .003524 022945 .08989] . 194556 
19 .000093 001611 014194 .081919 313114 
KK RK KR KK EK 

20 .000021 000474 006144 .062623 - 
NOTE: 1. Table entries represent the equation: 


: 4) (20 on) 2A-{xty) 
i “1 é St | \ ot 


2. P (x,y) values lying between shaded (****) region define the 
acceptance region. 


* By permission from Submarine ASW Encounter Simulation Model Detection 
Validation (U) by Systems Analysis Office, ASW Systems Project Office 
(1967). 
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TABLE VIII 


SUMMARY OF THE ORIGINAL BINOMIAL 


TEST USING THE SUBMARINE Pg DATA 


RUN NUMBER P VALUE 


greater than 
greater than 
greater than 
greater than 
greater than 
greater than 
greater than 

less than 
greater than . 


OD OO DOAN DW OO FSP W YP —_— 
nao OH OH OH oO oO OF OOF OM 


_— 


less than . 


the probability of receiving the exact combination of model and real 
world detections as well as any of the more extreme combinations. 


Consider the data as presented in Table IX. 


TABLE IX 


FISHER EXACT TABLEAU WITH 
SUBMARINE DATA FROM INPUT 7 


NUMBER OF NUMBER OF 

DETECTIONS NON-DETECTIONS TOTAL 
REAL WORLD 3 > 8 
MODEL 80 40 120 
TOTAL 83 45 128 


The probability of receiving this combination of detections and non- 
detections is: 


83! 45! 8! 120! 


T2ar ar St Bol dor = 07905 


For a proof see [6]. 
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The more unlikely combinations, keeping the totals fixed, and their 


probabilities are listed in Table X. 


TABLE X 
MORE EXTREME TABLEAUS IN FISHER EXACT TEST FROM INPUT 7 












Dale NONDET. DENG NONDET. 





DEW. NONDET. 





REAL WORLD 
MODEL 








PROBABILITY NOS POU ZO53 .0001518 


The sum of all these probabilities is .10183; but this represents the 
critical region in only one tail of Po (x) and since the alternate 
hypothesis is compound the sum must be doubled. P is therefore .20277. 
The results of this test with all 10 inputs are listed in Table XI. 

These exact probabilities can be very tedious to compute and again 
approximations are available. A normal approximation when the sample 
size in large is described by Brownlee [7], along with guidelines on 
when the approximation 1S valid.™ Unfortunately none of the input 
results met the criterion but in three cases they came reasonably close. 
The results of using the approximations are listed in Table XI. 

Along with the standard attributes of nonparametric statistics 
the Fisher Exact Test and its normal approximation both have well 
defined power, 1-8, functions associated with them. 1-8 for input 1 
was computed using the methods suggested by Brownlee [8] and is listed 


in Table XI as .529. 








* \lorking in reverse it has been shown by Tocher [27] that the Fisher 
Exact Test can be used when the conditions of the normal approximations 
do not hold. 
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TABLE XI 


SUMMARY OF SUBMARINE MODEL P VALUES 
USING THE FREQUENCY OF DETECTION DATA 


FISHER EXACT NORMAL ORIGINAL 

INPUT WEST APPROXIMATION POWER BINONIAL 
1 954 .928 329 pes 
2 62 aS 
3 metihed .668 >ae5 
A .894 ae} 
S .968 > 4a 
6 1.000 .984 > aS 
7 .203 0 
8 .031 <aes) 
9 .266 oe) 
10 .101 <n 


For information concerning the power function and the Fisher Exact Test 
see [1,14,19]. Thus for the first time in the tests described, p, (x) 


and Py (x) can be found. 


G. SUMMARY OF TESTS WITH SUBMARINE DATA 

This concludes a far from exhaustive presentation of possible sta- 
tistical tests which could be used in validating the submarine model. 
Hopefully the types of assumptions that are necessary in nonparametric 
testing are reasonably clear. Brownlee, Bradley, and Seigel give a far 
more in-depth discussion of nonparametric statistics in their texts 
referenced in this section. For a more complete discussion of the 
power of nonparametric tests see [9] as well. 

Before going on to a parametric test and one where dependence among 
samples is considered, examine the difference in the critical regions 
obtained by using various tests requiring slightly different assumptions 


of the distribution of the test statistic, Tables VI and XI, pp. 32 and 37. 
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While care must be used in explaining the cause of the differences, 
it is certainly safe to say that the assumptions of the Wilcoxon Rank 
Test are most closely adhered to while the ranking procedure of the 
original rank sum test and the approximation of Pg (x) in the Kolmogorov- 


Smirnov Test would tend to discount the validity of their results. 


VI. PARAMETRIC AND NONPARAMETRIC TESTS WITH AIRCRAFT DATA BASE 


The data to be used in the statistical tests of this chapter was 
obtained from eight independent aircraft-submarine exercises. In each 
exercise aircraft monitored a string of eight sonobuoys in an attempt 
to gain and maintain the detection of a transitting submarine. Al11 
exercises were made under similar conditions and therefore the condi- 
tions can be considered identical. The measure of effectiveness in 
these exercises is detection modulus or probability of detection. 
Detection modulus, D.M., is computed by dividing the total number of : 
minutes detection was held by the total number of minutes detection | 
could have been held. 


ae TIME DETECTION WAS HELD 
"TIME DETECTION COULD HAVE BEEN HELD 


For each of the eight runs the range and aspect of initial detection 
for each buoy was tabulated. Also recorded were the range and aspect 
at the time of losing contact, and the same information in the event 
that contact was regained. Because of this extensive data base there 
are several random variables which might be tested. All of these fall 
into two categories; however, those in which the assumption is made 
that the samples are independent and identically distributed and those 
which assume only that the samples are identically distributed. The 
Paired t, Wilcoxon, and Kolmogorov-Smirnov Tests fall into the first 
category while the Davisson Test falls into the latter. 

Thus using detection modulus as a measure of effectiveness the non- 
parametric, Wilcoxon and Kolmogorov-Smirnov Tests and parametric Paired 


t and Davisson Test will be used to demonstrate the spectrum of tests 


oie. 


and their characteristics that might be used in validating a model with 


this type of data base. 


A. PAIRED t TEST 

Since we are testing the hypothesis that the real world and the model 
are sampling from the same distribution it is only natural to compare the 
differences in their outputs. Let d. represent the difference between 


the real world and detection moduli for run i, and let 


Fees “1s derimned as. 


2a PELE 
- J (d;- 


then it can be shown [13] that 


_ nD 
t < 





is asymptotically t distributed with n-1 degrees of freedom. 

Since this test assumes that the d. are independent, the sample size 
is eight and each sample is the difference between the real world and 
model estimate of the true detection modulus computed with all eight 
buoys operating in concert. Table XII shows the actual data and part 
of the calculations. The corresponding P value is approximately .42. 

The Paired t Test has the advantages of the parametric tests, and 
Dy (x) is known exactly to be ie 1) for large n. Since this distribution 
is well tabulated and the arithmetic is basic, the test is easy to 
perform. The test does make some very restrictive assumptions. The 
independence assumption forces the aggregation of the data to the extent 


that much information may be lost. The asymptotic property of the 
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distribution of the test statistic adds another degree of complexity 
for no longer is Py (x) known exactly. It is known only in the limit 


as n increases. 


TABLE XII 
INDEPENDENT AIRCRAFT DATA FOR PAIRED t TEST 








RUN REAL WORLD MODEL d. d.-D 
D.M. DM. 

4865 3478 1387 1748 
2 3587 4023 - 0436 0075 
3 2500 4711 ~ 2211 1850 
4 4134 5034 - 0900 0539 
5 1729 2967 - 1238 0877 
6 3884 4126 - 0242 0119 
7 2601 2517 0094 0445 
8 3171 2702 0469 0830 

2's a (d,-D)? = a = .012635 
i=l 
,- 2D _J8 (-.0361) 
g T124 
t = -.9084 


B. KOLMOGOROV-SMIRNOV TEST 

Another way of comparing the differences in the samples is by the 
Kolmogorov-Smirnov Test. The relative merits of the test have been dis- 
cussed, but this data presents an opportunity to compare the exact Pg (x) 
for the Kolmogorov-Smirnov Test to the previously used approximation of 
Pg (x) - Using the data given in Table XII, the maximum deviation is 
.25, and by the previous approximation to Pg (x) the corresponding P 
value is .9639 whereas by Massey's exact computation [18] the P value 
is .6602. This very large difference indicates the dangers in using 


this approximation to Pg (x). 
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C. WILCOXON TEST WITH NORMAL APPROXIMATION 

The Wilcoxon Test when performed on the data of Table XII and with 
use of the normal approximation to Py (x) yields a P value of .46. The 
relative merits of this test are the same as described previously, and 


the results are given only for comparative purposes. 


D. DAVISSON TEST WITH DEPENDENCE 

In the past three tests the sample size was eight due to the fact 
that independence among samples was required by each test. What if 
one wanted to compare the average detection modulus of each buoy on 
each run or perhaps the average detection modulus in each five mile 
range band from -50 to 50 miles for each buoy on each run? In these 
cases and the many others that might be considered the values are 
dependent on each other and thus none of the assumptions of the tests 
mentioned so far are completely satisfied. 

Davisson has shown that by comparing certain differences between the 
real world and model results that the maximum likelihood ratio yields a 
statistic with a known distribution [11]. 

Since the test is very tedious only a relatively short comparison 
with the aircraft data will be given. Consider the detection moduli 
of buoys 3, 4, and 5 on each run. The random variable to be tested is 
the average detection modulus of each buoy. Thus the null hypothesis 
is that the average detection moduli for buoys 3, 4, and 5 are the same 
in the real world as they are in the model and that their interdependence 
is also identical. 

The first step in determining the test statistic is to compute the 


variance-covariance matrix of the computer's average detection moduli. 
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To do this the average detection modulus for each run on a buoy is 
subtracted from the average for that buoy over all runs. 


The results are shown in Tables XIII and XIV. 


TABLE XITI 
AVERAGE BUOY DETECTION MODULI FOR THE MODEL AND THEIR AVERAGES 


BUOY 
3 4 5 

240755 560000 491032 
2 365082 457377 595555 

3 104921 596825 514203 

ae E 584210 941052 782857 
5 051273 164909 158269 
6 496562 375031 181154 
7 038730 465397 579434 
8 209259 563148 468054 
AVERAGE 261349 513217 468054 

TABLE XIV 
AVERAGE MODEL DETECTION MODULI MINUS THEIR AVERAGES 
BUOY 
3 4 5 

- 020594 046783 - 049022 

2 103733 055840 127501 
3 - 156428 083608 046149 
4 322862 427835 314803 
OE alls - ,210076 348308 - 309785 
6 235214 156186 - .286900 
7 . ,222619 047820 111380 

g - 052090 049931 045875 


The transpose of the 8x3 matrix in 


itself yields the variance-covariance matrix 9. 


in Table XV. 
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Table XIV when multiplied by 


The Q matrix is shown 


TABLE XV 
VARIANCE-COVARIANCE MATRIX Q 


.036453 .020347 .0098831 
.020347 .043229 .034850 
.0098831 .034850 .039085 


Now a difference vector M is computed with each component being equal 
to the difference between the average real world detection modulus 


overall eight runs and the corresponding results from the model. 


TABLE XVI 
DIFFERENCE VECTOR M 


REAL WORLD MODEL M 
AVERAGE AVERAGE 
-449466 .261349 . 1881] 
. 365698 ey EZ) /! -. 1475 
sooo O .468054 idles 


Davisson has stated [11] that the distribution of 
viemuy 
is asymptotically chi-squared with N degrees of freedom where N is the 
dimension of Q. In this case ‘uy 1S 9.7682 and the corresponding 
P value is .02. 

As was the case with the Paired t Test, this test has the advantages 
of being parametric but the disadvantages of its asymptotic properties 
and lack of knowledge of py (x). The main drawback of the Davisson Test 
is its computational difficulty. As the dimension of Q increases a 
large computer becomes necessary and the sorting of data becomes quite 
tedious. Care must also be taken that accuracy is not lost in the 
inversion of Q and that subsets are chosen such that Q is not singular. 


In spite of all these disadvantages, the relief from the independence 
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assumption is very advantageous. If tolerance of its assumptions 
permits its use, the Davisson Test will yield a more detailed valida- 
tion test. It is now possible to reject part of the model while 
accepting the rest, thus allowing trouble-shooting for the simulation 
analysts. This feature was also possible with the submarine model but 
only because 10 different inputs were sampled and thus data collection 


had to be more extensive and also more costly. 


E. SUMMARY OF TESTS WITH AIRCRAFT DATA 
The P values corresponding to each of the four tests applied to the 


aircraft data are listed in Table XVII. 


TABLE XVII 


SUMMARY OF P VALUES USING PARAMETRIC AND 
NONPARAMETRIC TESTS ON THE AIRCRAFT MODEL DATA 


PAIRED t KOLMOGOROV -SMIRNOV WILCOXON DAVISSON 
EXACT APPROX. 
42 . 6602 . FH - 46 AO2 


It 1S not appropriate to compare the results of the Davisson Test 
to those of the other tests due to its unique properties, nor is it 
feasible to pass judgement on the remaining tests solely on the results 
in Table XVII. It should be noted however that the Kolmogorov-Smirnov 
and Wilcoxon Test results are based on exact knowledge of Pg (x) while 
the Paired t Test and the approximate Kolmogorov-Smirnov Test are not, 
and that no additional knowledge of py (x) is obtained by using these 
approximations. While the distribution of the Davisson Test statistic 
is not exact nor is information about p, (x) available, it does allow a 
more localized validation thereby allowing "trouble-shooting" which 


the other tests do not permit. 
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As with the submarine data, these tests are far from an exhaustive 
set of all those possible. They were chosen to represent the range and 
spectrum of assumptions needed to perform the validation of this type 


model with its data base. 
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VII. SUMMARY AND CONCLUSIONS 


This paper has investigated the most salient problems of present day 
validation procedures and alleviated them by enlarging the scope of 
validation and by describing what is needed and can be expected from 
a statistical test with "validation type" data. It was shown that 
decision theory and cost analysis while present in previous validations 
received no mention, and that statistical testing with its pass or fail 
results did not allow the decision maker much flexibility. While only 
two simple decision rules and one type of decision criterion were 
presented, it became obvious that by determining P values from several 
tests and by trying to do such things as minimizing expected cost, the 
decision maker could avail himself of more information and have the 
capability to change more elements in his decision rule. 

A general methodology for the statistical testing of validation 
data was also discussed. Included in the methodology are the goals of 
a "validation test," the types of tests available with their inherent 
assumptions and properties, the need for multiple testing, and the 
pitfalls of relaxing assumptions within a test. 

It was seen that while a myriad of possible tests exists, those 
having exact knowledge of Po (x) and p(x) will be the best. But, since 
p(x) is seldom known due to the nature of the alternate hypothesis and 
calculation procedures necessitate approximations to Py (x) in many 
cases, these desirable tests are not always available. Some tests are 
clearly better than others, but in general, it was seen that several 
tests using different assumptions should be used to achieve the most 


reliable information about P and 8. 
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In conclusion the problems of validation are analogous to those of 
systems analysis and cost-effectiveness. The goal or criterion can be 
defined as minimization of expected cost for a fixed level of validity, 
yet the methods of exact determination are not as well defined and need 
to be considered in concert instead of individually. In the past, one 
of the methods was statistical testing. When used alone there existed 
reasons to criticize the validations but when used in the procedure as 
presented in this paper, the validator has more flexibility and is able 
to use more information from his data and other sources. 

Another important advantage of this procedure is the increased 
ability to see the effects of changes in a decision rule. All that 
could be seen previously was that at a significance level of .6 the 
model was considered invalid but at a .4 level it was not. Now such 
things as the changes in a decision rule caused by refusing to accept 
a priori knowledge of the states of nature can be observed. 

So just as was done with systems analysis a new approach or way of 
looking at a problem has been proposed. This time it is to help the 
decision maker with his important and complex problems of model 


validation. 
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VIII. AREAS FOR FUTURE STUDY 


Since this paper represents a pilot study in the expansion of model 
validation, almost any facet of the paper could and should be expanded. 
The area of simulation theory is normally not considered an O.R. 
problem at least in the context of calibrating the model. The search 
for more nearly perfect statistical tests is also considered as second 
in importance to the development of decision rules applicable to model 

validation. 
After several decision rules have been presented then case studies 
Similar to those of systems analysis will make a valuable contribution 


to the field of validation. 
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